Opportunities for research data discovery and reuse - lessons learned from 20 years of geospatial data platform evolution

Overview

Building on three decades of data management, analysis, and visualization experience - particularly in the areas of database design and geospatial data management - and over 20 years of web application development experience, with an emphasis on web-based data discovery and access tools, this presentation is intended to illustrate the lessons that have been learned from the development of several generations of applications intended to facilitate effective discovery, access, and use of increasingly large collections of geospatial (and ultimately more general) data. Highlighting these lessons can help inform decision making about the strategies and tools developed and selected to enable more effective research data preservation, discovery, and sharing in our current context of increased attention to maximization of the impact of research investments in research data creation and documentation.

 

Introduction

 

A consideration of the initiative that began in the Mid-1990s to establish a US National Spatial Data Infrastructure highlights some similarities with the development and expansion of requirements for increased public access to research data as a product of funded research projects and in association with publications. In the nearly 25 years since the initiation of the NSDI program, the evolution of the NSDI, both as a national program, and as a collection of local repositories that constitute the network, can provide insights into how a still emerging system for preserving and sharing research data may be developed and grown.

Critical questions of system interoperability, adoption of standards, and architectural models are as relevant today in the context of our emerging network of data repositories as they were 25 years ago when a distributed network of geospatial data providers was envisioned as the national-scale NSDI.

 

Development of the NSDI

 

In April 1994 President Clinton signed Executive Order 12906 entitled Coordinating Geographic Data Acquisition and Access: The National Spatial Data Infrastructure. (link).

Through this Executive Order the National Geospatial Data Clearinghouse was established as a “a distributed network of geospatial data producers, managers, and users linked electronically.”.

Key elements of the clearinghouse model included:

 

 

The initial release of the FGDC Clearinghouse was a simple web interface, and an underlying z39.50 search service that executed a provided search across the registered clearninghouse nodes. Each clearinghouse node was running their own local search service (typically based on the ISite server platform (site) that had been extended to support geospatial searches within the z39.50 standard) based upon locally produced FGDC metadata records.

At its peak (in 2002-2003), the FGDC clearinghouse node network included over 250 contributors, including a significant number of international nodes - transforming the network from an implementation of the NSDI into a Global Spatial Data Infrastructure (GSDI).

2003 - A Platform for Discovering Data and Services

 

In 2003 the Geospatial One-Stop (GOS) platform was released as a successor to the FGDC Clearinghouse. GOS provided a number of methods for integrating metadata into its search platform, including continued support for z39.50 ISite nodes but also adopting a model for harvesting FGDC metadata records from contributing organizations. As part of the registration process organizations could also register Open Geospatial Consortium map services published using the Web Map Server (WMS) standard. This registration process allowed GOS to provide a platform for both discovering these services, but also for using these services to allow users to view data hosted by providers within the GOS platform without having to download the underlying data.

2009 - A new platform

 

As part of the Obama administration’s Open Government Initiative the Data.gov web portal was established as a central access point for datasets produced by executive branch federal agencies. As part of the establishment of the Data.gov platform, the geospatial metadata that had been searchable through GOS were migrated into the new platform and made available through the catalog.data.gov search interface.

 

The catalog.data.gov search interface provides a much richer set of browse and search tools than had been provided within GOS, and presented much more detailed information about available data formats and supported services for registered datasets.

NSDI Summary

 

Getting Local - the New Mexico Resource Geographic Information System

From the Earth Data Analysis Center web site, New Mexico Resource Geographic Information System information page. The current RGIS Clearinghouse can be accessed at http://rgis.unm.edu

RGIS The New Mexico Resource Geographic Information System (RGIS) was designated as the state digital geospatial data clearinghouse by the New Mexico Legislature in 2013. NM RGIS is the state’s only geospatial data clearinghouse and has been hosted and managed by EDAC for over 23 years.

The RGIS Program was created by the NM Legislature in 1988 and was designed, developed specifically by the EDAC and the Bureau of Business and Economic Research (BBER) at the University of New Mexico.

The RGIS data clearinghouse hosts a wide variety of geospatial data for New Mexico. Data sets available for download include political and administrative boundaries, place names and locations, census data (current and historical), 28 years of digital orthophotography, 80 years of historic aerial photography, satellite imagery, elevation data, transportation data, wildfire boundaries and natural resource data.

The data are publicly available on http://rgis.unm.edu. Data sets that are too large are available for purchase. To place a custom order, contact clearinghouse via phone (505) 277-3622 or via email clearinghouse@edac.unm.edu.

1998-2001 - An online brochure

 

The online presence of RGIS became available in 1998 with the creation of the first RGIS website as an online information resource about the program and the data products available for order through the clearinghouse’s office.

In conjunction with the development of the RGIS website, the program also, as an early collaborator with the FGDC’s NSDI initiatives established an FGDC Clearinghouse node as part of the nationwide clearinghouse network. This clearinghouse node was based on a collection of FGDC metadata records that had been created through support of FGDC through their Cooperative Agreement Program (CAP).

 

 

2001-2011 - An online catalog

 

In 2001 the previous RGIS website was replaced with a dynamic web site that provided the previous program information, but also allowed users to browse and perform basic search functions within the system to identify datasets of interest, preview some datasets, and download those data in available pre-processed formats.

 

 

2008 - Targeted integration of OGC Services

 

In 2008 the RGIS web site was enhanced to provide dynamically generated Open Geospatial Consortium Web Map Services (WMS) for a subset of newly added imagery and other datasets within the system. This was intended to both support direct user interaction with the data content of the system without having to download the underlying data, and also provide dynamically generated previews that were added to web-based views of the metadata associated with data in the sytem.

2011-Present - Adoption of a full Services Oriented Architecture

 

In 2011 a complete redesign of the RGIS system was released that was based upon a tiered Services Oriented Architecture (SOA) that was developed with support of the National Science Foundation’s EPSCoR program, the NM RGIS program, and NASA, This sytem was designed to provide a degree of separation between the data management components of the system and the interfaces through which custom client applications (such as the RGIS clearinghouse web site, the EPSCoR data portal, a member node of the DataONE network, and processes that populated the web accessible folders upon which the RGIS Data.gov materials are based) interact with the data managed in the underlying platform. The service interfaces use the REST web service architecture based upon the HTTP web protocol. The services provided by the platform include standard-based OGC Web Mapping, Web Feature, and Web Coverage Services (WMS, WFS, and WCS respectively), and custom services that enable data discovery, and data and metadata download.

 

 

Under the Hood - The GSToRE Platform

 

The Geographic Storage, Transformation and Retrieval Engine (GSToRE) is the the data management and servie platform that manages hundreds of thousands of individual data objects and their associated metadata, and provides web-accessible services that support the full suite of service options: Create, Read, Update, and Delete of data objects and metadata. Through these services a system (local or remote) can (with appropriate access permissions) interact with the platform to discovery data, access data in a variety of supported formats, make use of geospatial data and visualization services defined by the OGC, and view dataset documentation in a variety of formats. The services provided by the platform also enable the development of client applications that provide different target audiences (whether human or machine) with customized interfaces that are specifically designed to meet their particular needs. Current client applications enabled by the GSToRE platform include:

 

 

GSToRE Architectural Model

 

Summary

Increasing demands lead to architectural changes

 

 

Lessons Learned

Applying These Lessons to the Current Research Data Repository Landscape

 

A cursory examination of the Registry of Research Data Repositories (re3data.org suggests that while there are a large number of data repositories that have been created and registered (2036), only a relatively small number provide an Application Programming Interface:

A large number of metadata standards:

And a wide range of data types:

These characteristics highlight areas for improvement (or criteria for selection) for the large universe of data repositories:

Questions